Goto

Collaborating Authors

 ar model


Diffusion Beats Autoregressive in Data-Constrained Settings

Neural Information Processing Systems

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings--where training involves repeated passes over limited data--and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closedform expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-toright factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.


Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

Neural Information Processing Systems

Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2(DD2), to further advance the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model that provides the ground truth conditional score in the latent embedding space at each token position.


Anchored Diffusion Language Model

Neural Information Processing Systems

Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both and . We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the, a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4\% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches.


Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

Neural Information Processing Systems

Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping.







grangersearch: An R Package for Exhaustive Granger Causality Testing with Tidyverse Integration

arXiv.org Machine Learning

Understanding causal relationships between time series variables is a fundamental problem in economics, finance, neuroscience, and many other fields. While true causality is philosophically complex and difficult to establish from observational data alone, Granger (1969) proposed a practical, testable notion of causality based on predictability: a variable X is said to "Granger-cause" another variable Y if past values of X contain information that helps predict Y beyond what is contained in past values of Y alone. Granger causality testing has found applications across diverse domains. In macroeconomics, Sims (1972) famously applied the technique to study money-income relationships, while Kraft and Kraft (1978) pioneered its use in energy economics. Financial market researchers including Hiemstra and Jones (1994) have extended the methodology to study price-volume dynamics, and neuroscientists have adapted Granger causality for brain connectivity analysis (Seth, Barrett, and Barnett 2015). The statistical foundations rest on vector autoregressive (V AR) models (Sims 1980), with comprehensive treatments available in Lütkepohl (2005) and discussions of causal interpretation in Peters, Janzing, and Schölkopf (2017). Despite its popularity, implementing Granger causality tests in R (R Core Team 2024) remains cumbersome for applied researchers.